In the current landscape of the film industry, data analysis has become an essential tool for grasping market trends and predicting the success of movies. In this data analysis endeavor, our aim is to delve into an extensive film database sourced from The Movie Database (TMDB) to pinpoint and categorize the standout films of the past five years.
This data analysis project is geared towards offering a comprehensive and updated perspective on the cinematic scene. Leveraging data analysis techniques in R, our objective is to identify the top-performing movies from the last five years and gain insights into the factors underpinning their success.
In this data analysis project, we will utilize a comprehensive database containing over 1 million movies (1008166), sourced from Full Movies TMDB 2024. The dataset comprises the following variables:
TMDB_movie_dataset_v11.csv| Variable | Description |
|---|---|
id |
Unique identifier for each movie |
vote_count |
Number of votes received by the movie |
revenue |
Revenue generated by the movie |
backdrop_path |
Path to the backdrop image of the movie |
imdb_id |
IMDb identifier of the movie |
overview |
Summary or plot synopsis of the movie |
tagline |
Promotional tagline of the movie |
production_countries |
Countries where the movie was produced |
title |
Title of the movie |
status |
Production status of the movie |
runtime |
Duration in minutes of the movie |
budget |
Production budget of the movie |
original_language |
Original language of the movie |
popularity |
Popularity index of the movie |
genres |
Genres of the movie |
spoken_languages |
Spoken languages in the movie |
vote_average |
Average rating of the movie by users |
release_date |
Release date of the movie |
adult |
Indicator of whether the movie is for adults |
homepage |
Official homepage of the movie |
original_title |
Original title of the movie |
poster_path |
Path to the poster image of the movie |
production_companies |
Production companies involved in the movie |
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(psych)
## Warning: package 'psych' was built under R version 4.2.3
##
## Attaching package: 'psych'
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
setwd("~/Proyectos R/archive (5)")
TMDB_data <- read.csv("TMDB_movie_dataset_v11.csv")
head(TMDB_data, 10)
glimpse(TMDB_data)
## Rows: 1,008,166
## Columns: 23
## $ id <int> 27205, 157336, 155, 19995, 24428, 293660, 299536,…
## $ title <chr> "Inception", "Interstellar", "The Dark Knight", "…
## $ vote_average <dbl> 8.364, 8.417, 8.512, 7.573, 7.710, 7.606, 8.255, …
## $ vote_count <int> 34495, 32571, 30619, 29815, 29166, 28894, 27713, …
## $ status <chr> "Released", "Released", "Released", "Released", "…
## $ release_date <chr> "2010-07-15", "2014-11-05", "2008-07-16", "2009-1…
## $ revenue <dbl> 825532764, 701729206, 1004558444, 2923706026, 151…
## $ runtime <int> 148, 169, 152, 162, 143, 108, 149, 139, 121, 154,…
## $ adult <chr> "False", "False", "False", "False", "False", "Fal…
## $ backdrop_path <chr> "/8ZTVqvKDQ8emSGUEMjsS4yHAwrp.jpg", "/pbrkL804c8y…
## $ budget <int> 160000000, 165000000, 185000000, 237000000, 22000…
## $ homepage <chr> "https://www.warnerbros.com/movies/inception", "h…
## $ imdb_id <chr> "tt1375666", "tt0816692", "tt0468569", "tt0499549…
## $ original_language <chr> "en", "en", "en", "en", "en", "en", "en", "en", "…
## $ original_title <chr> "Inception", "Interstellar", "The Dark Knight", "…
## $ overview <chr> "Cobb, a skilled thief who commits corporate espi…
## $ popularity <dbl> 83.952, 140.241, 130.643, 79.932, 98.082, 72.735,…
## $ poster_path <chr> "/oYuLEt3zVCKq57qu2F8dT7NIa6f.jpg", "/gEU2QniE6E7…
## $ tagline <chr> "Your mind is the scene of the crime.", "Mankind …
## $ genres <chr> "Action, Science Fiction, Adventure", "Adventure,…
## $ production_companies <chr> "Legendary Pictures, Syncopy, Warner Bros. Pictur…
## $ production_countries <chr> "United Kingdom, United States of America", "Unit…
## $ spoken_languages <chr> "English, French, Japanese, Swahili", "English", …
Upon examination, our database boasts an impressive count of precisely 1,008,166 movies. Furthermore, we’re equipped with a rich array of over 20 variables for comprehensive analysis.
Variable Types Breakdown:
chr (Character): Comprising 16 variables
int (Integer): Encompassing 4 variables
dbl (Double): Incorporating 3 variables
missing_values <- colSums(is.na(TMDB_data))
tibble(Variables = names(missing_values), missing_values = missing_values)
Great news! Our dataset is clean and free of any missing values. We’re all set to dive into our analysis with confidence!
numeric_variable <- TMDB_data %>%
select_if(is.numeric)
describe(numeric_variable)
Interesting, firstly, we notice that both revenue and
runtime have negative values, which is entirely impossible
in movies. revenue also has many values equal to 0, which
is logical because nowadays a large portion of movies released do not
exclusively premiere in big cinemas but also on platforms like Netflix,
HBO MAX, Disney+, etc. This makes it very complicated to calculate movie
revenues. Additionally, the maximum value of runtime is an
enormous number, which also lacks logic.
We also observe:
Let’s attempt to correct these outliers:
Outliers_values <- TMDB_data %>%
filter( title %in% c("Adventures in Bora Bora", "The Bear") | title == "Untitled" | original_title == "Untitled" | status == "Canceled" | status == "Rumored" | status == "Planned" | revenue < 0 | runtime < 0 )
#checking revenue
TMDB_data <- TMDB_data %>%
filter(!revenue < 0, !title %in% c("Adventures in Bora Bora", "The Bear"))
#checking runtime
TMDB_data <- TMDB_data %>%
filter(!runtime < 0)
#checking title without names
TMDB_data <- TMDB_data %>%
filter(!title == "Untitled")
#checking Films canceled, Rumored and Planned
TMDB_data <- TMDB_data %>%
filter(!status %in% c("Canceled", "Rumored", "Planned"))
Outliers_values
Previously, we observed a significant number of character-type variables. Let’s proceed by transforming these variables into categorical variables to facilitate subsequent movie analysis.
TMDB_data_pre <- TMDB_data %>%
select(-release_date) %>%
mutate_if(is.character, as.factor)
summary(TMDB_data_pre)
## id title vote_average vote_count
## Min. : 2 Home : 141 Min. : 0.000 Min. : 0.00
## 1st Qu.: 370952 Alone : 89 1st Qu.: 0.000 1st Qu.: 0.00
## Median : 665900 Mother : 83 Median : 0.000 Median : 0.00
## Mean : 664662 Love : 70 Mean : 2.102 Mean : 21.39
## 3rd Qu.: 975640 The Gift: 70 3rd Qu.: 5.000 3rd Qu.: 1.00
## Max. :1260267 The Box : 68 Max. :10.000 Max. :34495.00
## (Other) :1001981
## status revenue runtime adult
## In Production : 5909 Min. :0.000e+00 Min. : 0.0 False:908922
## Post Production: 5381 1st Qu.:0.000e+00 1st Qu.: 1.0 True : 93580
## Released :991212 Median :0.000e+00 Median : 30.0
## Mean :7.388e+05 Mean : 50.7
## 3rd Qu.:0.000e+00 3rd Qu.: 90.0
## Max. :2.930e+09 Max. :14400.0
##
## backdrop_path budget
## :717554 Min. : 0
## None : 883 1st Qu.: 0
## /3CxwYgqGtJ6UEGfWUT0gMYCIlFP.jpg: 157 Median : 0
## /pOXuMdKnWO9hK8drDahJVQxILHx.jpg: 66 Mean : 289206
## /6r2onqJ2S7XhtnU3HbvNmEv8SXK.jpg: 47 3rd Qu.: 0
## /60t9ckELUGyinDQNDULFDq2i7u7.jpg: 44 Max. :900000000
## (Other) :283751
## homepage imdb_id original_language
## :893656 :430416 en :538915
## https://animation.geidai.ac.jp: 147 None : 832 fr : 60589
## http://www.eldoradofilms.com : 73 tt13904644: 18 es : 51724
## https://www.yamidouga.com/ : 63 tt26900526: 8 de : 50974
## https://uaebabes.com : 57 tt23810972: 7 ja : 43919
## http://www.ufc.com : 52 tt27048168: 7 pt : 28185
## (Other) :108454 (Other) :571214 (Other):228196
## original_title overview popularity
## Home : 91 :182271 Min. : 0.000
## Carmen : 62 Mexican feature film: 912 1st Qu.: 0.600
## Limbo : 57 : 619 Median : 0.600
## Solo : 56 Plot Unavailable. : 548 Mean : 1.346
## Alone : 55 Hong Kong movie : 450 3rd Qu.: 0.912
## The Box: 50 Nikkatsu Roman Porno: 356 Max. :2994.357
## (Other):1002131 (Other) :817346
## poster_path tagline
## :276670 :858506
## None : 598 English : 165
## /wtoKLMm4UvkwvcSwO3XWcs1gJuF.jpg: 54 animation short : 104
## /cWjdh8VTiizYfQp5m6fJi4PDy8w.jpg: 48 100% Lesbian : 73
## /je3JbUs3OEoYkS6Vd7iv7w6HUPu.jpg: 45 Documentary : 60
## /qpXweJ0Gbl5OmYZqzNWtDJovF8e.jpg: 41 AN INSIDE NO.9 FILM: 46
## (Other) :725046 (Other) :143548
## genres production_companies
## :371589 :527616
## Documentary:124523 Evil Angel : 2805
## Drama : 99077 ONF | NFB : 2130
## Comedy : 55629 BBC : 2025
## Animation : 28941 Metro-Goldwyn-Mayer: 2025
## Music : 23773 Columbia Pictures : 1927
## (Other) :298970 (Other) :463974
## production_countries spoken_languages
## :409960 :399804
## United States of America:165664 English :213258
## Japan : 37093 Japanese : 36555
## Germany : 31611 French : 33940
## United Kingdom : 31258 Spanish : 33645
## France : 30628 No Language: 26900
## (Other) :296288 (Other) :258400
Excellent, we now have a somewhat clearer overview of the database for our subsequent analysis.
Nevertheless, we don’t require as many variables for our primary goal; let’s choose only the most pertinent variables and data from 2018 onwards.
TMDB_data_pre$release_date <- TMDB_data$release_date
TMDB_data <- TMDB_data_pre
TMDB_data <- TMDB_data %>%
select(id, title, release_date, status, runtime,popularity, vote_average, vote_count, revenue, budget, production_companies, genres, original_language, adult) %>%
filter(release_date > 2018)
TMDB_data$release_date <- as.Date(TMDB_data$release_date, format = "%Y-%m-%d")
TMDB_data$year <- year(TMDB_data$release_date)
TMDB_data$month <- month(TMDB_data$release_date, label = TRUE)
TMDB_data %>%
filter(vote_average > 7) %>%
arrange(desc(popularity)) %>%
head()
We’ve pruned unnecessary variables and filtered our updated database to include only films released after 2018, focusing on the past 5 years.
Let’s keep in mind our goal: to identify the cream of the crop! To do so, we’ll narrow our focus to feature-length films with substantial budgets, relying on popularity and average ratings to pinpoint the top contenders.
TMDB_data <- TMDB_data %>%
filter(runtime > 60) # para excluir cortometrajes
TMDB_data <- TMDB_data %>% # para excluir la gran mayoria de peliculas con bajo presupesto
filter(budget > 100000) %>%
mutate(revenue_in_million = round(revenue/1000000, 3), budget_in_million = round(budget/1000000,3))
TMDB_data %>%
select(-c("revenue", "budget"))
TMDB_data %>%
arrange(budget_in_million)
TMDB_data <- TMDB_data %>%
filter(vote_average > 7) %>%
arrange(desc(popularity))
glimpse(TMDB_data)
## Rows: 738
## Columns: 18
## $ id <int> 565770, 980489, 1008042, 385687, 678512, 346698, …
## $ title <fct> "Blue Beetle", "Gran Turismo", "Talk to Me", "Fas…
## $ release_date <date> 2023-08-16, 2023-08-09, 2023-07-26, 2023-05-17, …
## $ status <fct> Released, Released, Released, Released, Released,…
## $ runtime <int> 128, 135, 95, 142, 131, 114, 102, 118, 124, 127, …
## $ popularity <dbl> 2994.357, 2680.593, 1458.514, 1175.267, 1111.036,…
## $ vote_average <dbl> 7.139, 8.068, 7.214, 7.265, 7.973, 7.279, 7.757, …
## $ vote_count <int> 1023, 702, 973, 3881, 503, 5074, 2467, 56, 21, 32…
## $ revenue <dbl> 124818235, 114800000, 72600000, 704709660, 212587…
## $ budget <int> 120000000, 60000000, 4500000, 340000000, 15000000…
## $ production_companies <fct> "Warner Bros. Pictures, The Safran Company, DC Fi…
## $ genres <fct> "Action, Science Fiction, Adventure", "Action, Dr…
## $ original_language <fct> en, en, en, en, en, en, en, en, en, en, en, en, e…
## $ adult <fct> False, False, False, False, False, False, False, …
## $ year <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
## $ month <ord> ago, ago, jul, may, jul, jul, jun, sep, sep, jun,…
## $ revenue_in_million <dbl> 124.818, 114.800, 72.600, 704.710, 212.587, 1428.…
## $ budget_in_million <dbl> 120.0, 60.0, 4.5, 340.0, 15.0, 145.0, 200.0, 13.0…
Amazingly, after completing all the data cleaning and transformation,
we’re left with just 739 movies out of over
1 million movies.
With the necessary data at hand, we delve into an in-depth analysis of top movies, examining each variable meticulously.
ggplot(TMDB_data, aes(year))+
geom_histogram(binwidth = 1, fill = "#67C6E3", color = "black") +
labs(title = "Total Number of Top Movies Released from 2018 to 2024", x = "year", y = "Number of Movies")
ggplot(TMDB_data, aes(x = month)) +
geom_bar(fill = "#67C6E3") +
theme(axis.text.x = element_text(angle = 90, vjust = -0.01)) +
facet_wrap(~year) +
labs(title = "Total Number of Top Movies Released by Year and Month (2018-2024)", x = "Month", y = "Number of Movies")
ggplot(TMDB_data, aes(runtime)) +
geom_histogram(fill = "#67C6E3") +
labs(title = "Movie Runtime Distribution", x = "Runtime (in minutes)", y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
TMDB_data %>%
filter(runtime > 400)
We observe a predominant trend towards movies with a duration of approximately 100 minutes, translating to around 1 hour and 40 minutes. This duration seems to be the favored choice among filmmakers for captivating storytelling.
ggplot(TMDB_data, aes(popularity)) +
geom_histogram(fill = "#67C6E3") +
labs(title = "Movie Popularity Distribution", x = "Popularity", y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
TMDB_data %>%
filter(popularity > 2000)
Upon closer examination of the dataset, it becomes evident that the majority of movies fall into the realm of relative obscurity, marked by a left-skewed distribution in popularity scores. However, two outliers, “Gran Turismo” and “Blue Beetle,” stand out as exceptionally popular among audiences, defying the norm.
ggplot(TMDB_data, aes(vote_average)) +
geom_histogram(fill = "#67C6E3")+
labs(title = "Movie Vote Average Distribution", x = "Vote Average", y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This histogram exhibits a slight left-skew, with scores predominantly clustered in the lower range, typically between 7 and 8 out of 10.
ggplot(TMDB_data, aes(vote_count)) +
geom_histogram(fill = "#67C6E3")+
xlim(0,27714)+
ylim(0,100) +
labs(title = "Movie Vote Distribution", x = "Vote", y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing missing values (`geom_bar()`).
TMDB_data %>%
filter(vote_count > 20000)
We observe a left-skewed distribution, similar to that of popularity, indicating that the majority of movies in our dataset receive relatively few votes. This trend suggests that these films may struggle to gain traction due to their limited visibility, making it difficult for TMDB users to discover these lesser-known gems. However, it’s noteworthy to mention the top four highest-rated movies, two of which hail from the Marvel universe:
Avengers: Infinity War: Garnering 27,713 votes with
an impressive average rating of 8.25
Avengers: Endgame: Accumulating 23,857 votes with a
closely comparable average rating of 8.26
Joker: Amassing 23,425 votes with a commendable
average rating of 8.16
Black Panther: Securing 21,053 votes with a
respectable average rating of 7.39
ggplot(TMDB_data, aes(revenue_in_million)) +
geom_histogram(fill = "#67C6E3")+
labs(title = "Movie Revenue Distribution in million", x = "Revenue (in million)", y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Similarly, we witness a left-skewed distribution in this histogram, indicative of the complexities associated with precisely measuring the true earnings of films released via online platforms. The prevalence of zero scores within our dataset underscores the inherent difficulty in assessing their financial performance accurately. Additionally, it’s noteworthy that the majority…
ggplot(TMDB_data, aes(budget_in_million)) +
geom_histogram(fill = "#67C6E3")+
labs(title = "Movie Budget Distribution in million", x = "Budget (in million)", y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
TMDB_data %>%
filter(budget_in_million > 300)
We continue to notice a left-skewed pattern in the histogram,
suggesting that most films have budgets below the one-million-dollar
threshold, typically falling within the range of $100,000
to $1,000,000. Notably, the three films commanding the
highest budgets are:
Avatar: The Way of Water: $460 million
Avengers: Endgame: $356 million
Fast & Furious X: $340 million
TMDB_data_earnings_losses <- TMDB_data %>%
filter(revenue_in_million > 0) %>%
mutate(earnings_and_losses = revenue_in_million-budget_in_million)
TMDB_data_earnings_losses <- TMDB_data_earnings_losses %>%
mutate(earn_of_lose = ifelse(earnings_and_losses >= 0, "Earnings", "Losses"), year = year(release_date))
TMDB_data_average_gains_and_losses <- TMDB_data_earnings_losses %>%
group_by(year, earn_of_lose) %>%
summarize(mean_earn_lose = round(mean(earnings_and_losses), 3))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
ggplot(TMDB_data_average_gains_and_losses, aes(year, mean_earn_lose, fill = earn_of_lose))+
geom_bar(stat = "identity")+
scale_fill_manual(values = c("Earnings" = "#67C6E3", "Losses" = "#FF6969")) +
geom_text(aes(label = mean_earn_lose, y = ifelse(earn_of_lose == "Earnings", mean_earn_lose * 1.05, mean_earn_lose * 1.4)), color = "black") +
labs(title = "Stacked Bar Chart of Average Gains and Losses by Year (2018-2024)", x = "", y = "mean_(in million)", fill = "earn/lose")
We can discern that 2019 emerged as a standout year, boasting the
highest average profits per film, exceeding $232 million.
However, it also faced losses averaging around
$11 million per film. Undoubtedly, it marked a significant
milestone for the cinema industry. Conversely, the subsequent year
witnessed a substantial downturn in profits among top-performing movies,
with earnings hovering around $77 million and losses
escalating to $23 million per film. While this figure may
appear substantial to the average viewer, within the cinematic realm, it
signifies a considerable setback, particularly considering these films’
high ratings and widespread popularity aimed at societal impact. It’s
crucial to acknowledge that 2020 bore the brunt of the COVID-19
pandemic, with global lockdowns likely contributing to the decline in
profits for most films.
Following that challenging period, subsequent years showed promise,
with profits more than doubling in 2021 compared to 2020, reaching
approximately $170 to $210 million.
TOP3_language <- TMDB_data_earnings_losses %>%
filter(!original_language == "cn") %>%
group_by(original_language) %>%
summarize(mean_earnings = mean(earnings_and_losses)) %>%
arrange(desc(mean_earnings)) %>%
head(3)
ggplot(TOP3_language, aes(reorder(original_language, mean_earnings), mean_earnings))+
geom_bar(stat = "identity", fill = "#67C6E3")+
coord_flip()+
labs(title = "Top 3 Languages by Average Revenue", x = "", y = "Mean revenue (in million)")
Impressively, Asian films command a substantial presence in the film industry. Specifically, a standout Chinese (zh) production released between 2018 and 2024 boasts an average profit nearing $300 million. This highlights the growing influence and lucrative nature of Asian cinema on a worldwide scale.